knitr::opts_chunk$set(
message = FALSE,
warning = FALSE
)
The dataset, found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29, consists of features which were computed from digitized images of fine needle aspirate (FNA) procedure of breast mass. The features describe characteristics of the cell nuclei present in the images.
Ten real-valued features are computed for each cell nucleus. The following nuclear features were analyzed:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The dataset has the following attribute information:
(a) Number of instances: 569
(b) Number of attributes: 32
ID
diagnosis: The diagnosis of breast tissues (M = malignant, B = benign) - Class distribution: 357 benign, 212 malignant
fractal_dimension_mean: mean for “coastline approximation” - 1
fractal_dimension_se: standard error for “coastline approximation” - 1
fractal_dimension_worst: “worst” or largest mean value for “coastline approximation”
Breast cancer is the most-common invasive cancer in women and affect about 12% of women worldwide (McGuire, A; Brown, JA; Malone, C; McLaughlin, R; Kerin, MJ (22 May 2015). “Effects of age on the detection and management of breast cancer”. Cancers. 7 (2): 908–29. doi:10.3390/cancers7020815).
The fine needle aspiration (FNA) procedure helps establish the breast cancer diagnosis. Together physical examination of breasts and mammography, FNAC can be used to diagnose breast cancer with a good degree of accuracy.
A well-described characteristics in terms of the cell nuclei through digitized image, including the establishment of patterns/models, can helps improve breast cancer diagnosis.
Figures 1, 2 and 3. Digital images from a breast FNA. M. W. Teague, W. H. Wolberg, W. N. Street, O. L. Mangasarian, S. Labremont, and D. L. Page. Indeterminate fine needle aspiration of the breast: Image analysis aided diagnosis. Cancer Cytopathology 81: 129-135, 1997. W. N. Street. Xcyt: A System for Remote Cytological Diagnosis and Prognosis of Breast Cancer. Management Sciences Department. University of Iowa, Iowa City, IA.
Objective: Analyse cell nuclei characteristics, and if possible, identify patterns related to the diagnosis of breast tissues (malignant or benign). Additionally, it will be proposed a machine learning models for diagnosis.
library(dplyr)
library(ggplot2)
library(tidyverse)
library(formattable)
library(reshape2)
library(pander)
library(ggpubr)
library(ggpmisc)
library(ltm)
library(randomForest)
library(GGally)
library(RColorBrewer)
library(car)
library(corrplot)
library(factoextra)
library(FactoMineR)
library(caret)
library(rpart)
library(rpart.plot)
library(gridExtra)
library(DT)
setwd("C:/Users/bdeta/Documents/R/Projects/2 - Breast Cancer")
df <- as.data.frame(read_csv("data.csv"))
df <- subset(df, select = -X33) # Remove the column X33 (NAs)
df$diagnosis <- as.factor(df$diagnosis) # Transform chr to factor
names(df) <- gsub(" ", "_", names(df)) # Fix spaces in column names
summary(df)
## id diagnosis radius_mean texture_mean
## Min. : 8670 B:357 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 M:212 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave_points_mean symmetry_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619
## Median :0.06154 Median :0.03350 Median :0.1792
## Mean :0.08880 Mean :0.04892 Mean :0.1812
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957
## Max. :0.42680 Max. :0.20120 Max. :0.3040
## fractal_dimension_mean radius_se texture_se perimeter_se
## Min. :0.04996 Min. :0.1115 Min. :0.3602 Min. : 0.757
## 1st Qu.:0.05770 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606
## Median :0.06154 Median :0.3242 Median :1.1080 Median : 2.287
## Mean :0.06280 Mean :0.4052 Mean :1.2169 Mean : 2.866
## 3rd Qu.:0.06612 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357
## Max. :0.09744 Max. :2.8730 Max. :4.8850 Max. :21.980
## area_se smoothness_se compactness_se concavity_se
## Min. : 6.802 Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.: 17.850 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median : 24.530 Median :0.006380 Median :0.020450 Median :0.02589
## Mean : 40.337 Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.: 45.190 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :542.200 Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave_points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave_points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst
## Min. :0.1565 Min. :0.05504
## 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
radius <- df %>%
dplyr::select(c(diagnosis, radius_mean, radius_se, radius_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_radius_mean = mean(radius_mean), Mean_radius_se = mean(radius_se), Mean_radius_worst = mean(radius_worst))
formattable(radius, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_radius_mean = color_tile("#f7d383", "#fec306"),
Mean_radius_se = color_tile("#eb724d", "#df5227"),
Mean_radius_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_radius_mean | Mean_radius_se | Mean_radius_worst |
|---|---|---|---|
| B | 12.14652 | 0.2840824 | 13.37980 |
| M | 17.46283 | 0.6090825 | 21.13481 |
The mean of radius variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('radius_mean','radius_se','radius_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x radius variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
Higher variability/spread for radius variables (mean, se, worst) was observed in the malignant breast cancer group.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=2, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the radius variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.5)
shapiro.tests <- t(as.data.frame(lapply(df[,c("radius_mean", "radius_se", "radius_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## radius_mean 3.105644e-14
## radius_se 1.224597e-28
## radius_worst 1.704294e-17
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the radius variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("radius_mean", "radius_se", "radius_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## radius_mean 2.692943e-68
## radius_se 6.217140e-49
## radius_worst 1.135630e-78
Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all radius variables (mean, se, worst) between the groups.
The malignant breast cancer group has the feature radius values (mean of distances from center to points on the perimeter) higher than the benign group.
cor.test(df$radius_mean, df$radius_worst)
##
## Pearson's product-moment correlation
##
## data: df$radius_mean and df$radius_worst
## t = 94.255, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9641806 0.9741064
## sample estimates:
## cor
## 0.969539
ggplot(df, aes(radius_mean, radius_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of radius variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, very strong (0.969539) and statistical significance (p-value < 2.2e-16) correlation between radius_mean and radius_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the radius feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$radius_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r): 0.7300285 strong
b2 <- biserial.cor(df$radius_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "moderate")
## Correlation value (r): 0.5671338 moderate
b3 <- biserial.cor(df$radius_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r): 0.7764538 strong
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$radius_mean %in% boxplot(df$radius_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 14
df[as.numeric(out_1),c("id", "diagnosis", "radius_mean")]
## id diagnosis radius_mean
## 83 8611555 M 25.22
## 109 86355 M 22.27
## 123 865423 M 24.25
## 165 8712289 M 23.27
## 181 873592 M 27.22
## 203 878796 M 23.29
## 213 8810703 M 28.11
## 237 88299702 M 23.21
## 340 89812 M 23.51
## 353 899987 M 25.73
## 370 9012000 M 22.01
## 462 911296202 M 27.42
## 504 915143 M 23.09
## 522 91762702 M 24.63
out_2 <- which(df$radius_se %in% boxplot(df$radius_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 38
df[as.numeric(out_2),c("id", "diagnosis", "radius_se")]
## id diagnosis radius_se
## 1 842302 M 1.0950
## 13 846226 M 0.9555
## 26 852631 M 1.0460
## 28 852781 M 0.8529
## 39 855133 M 1.2140
## 43 855625 M 0.9811
## 78 8610637 M 0.9806
## 79 8610862 M 0.9317
## 83 8611555 M 0.8973
## 109 86355 M 1.2150
## 123 865423 M 1.5090
## 139 868826 M 1.2960
## 162 8711803 M 1.0000
## 169 8712766 M 1.0880
## 211 881046502 M 0.8601
## 213 8810703 M 2.8730
## 219 8811842 M 0.9553
## 237 88299702 M 1.0580
## 251 884948 M 1.0040
## 259 887181 M 1.2920
## 266 88995002 M 1.1720
## 273 8910988 M 1.1670
## 291 89143602 B 0.8811
## 301 892438 M 1.1110
## 303 89263202 M 1.0720
## 340 89812 M 1.0090
## 353 899987 M 0.9948
## 367 9011494 M 0.9761
## 369 9011971 M 1.2070
## 370 9012000 M 1.0080
## 418 90602302 M 1.3700
## 461 911296201 M 0.9291
## 462 911296202 M 2.5470
## 469 9113538 M 0.9289
## 504 915143 M 1.2910
## 522 91762702 M 0.9915
## 564 926125 M 0.9622
## 565 926424 M 1.1760
out_3 <- which(df$radius_worst %in% boxplot(df$radius_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 17
df[as.numeric(out_3),c("id", "diagnosis", "radius_worst")]
## id diagnosis radius_worst
## 24 851509 M 29.17
## 83 8611555 M 30.00
## 109 86355 M 28.40
## 165 8712289 M 28.01
## 181 873592 M 33.12
## 213 8810703 M 28.11
## 220 88119002 M 27.90
## 237 88299702 M 31.01
## 266 88995002 M 32.49
## 273 8910988 M 28.19
## 340 89812 M 30.67
## 353 899987 M 33.13
## 369 9011971 M 30.75
## 370 9012000 M 27.66
## 462 911296202 M 36.04
## 504 915143 M 30.79
## 522 91762702 M 29.92
texture <- df %>%
dplyr::select(c(diagnosis, texture_mean, texture_se, texture_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_texture_mean = mean(texture_mean), Mean_texture_se = mean(texture_se), Mean_texture_worst = mean(texture_worst))
formattable(texture, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_texture_mean = color_tile("#f7d383", "#fec306"),
Mean_texture_se = color_tile("#eb724d", "#df5227"),
Mean_texture_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_texture_mean | Mean_texture_se | Mean_texture_worst |
|---|---|---|---|
| B | 17.91476 | 1.220380 | 23.51507 |
| M | 21.60491 | 1.210915 | 29.31821 |
The mean of texture variables (mean, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('texture_mean','texture_se','texture_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x texture variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
The variability/spread for texture variables (mean, se, worst) seems to be similar between the groups.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=2, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the texture variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.4)
shapiro.tests <- t(as.data.frame(lapply(df[,c("texture_mean", "texture_se", "texture_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## texture_mean 7.283581e-08
## texture_se 3.560601e-19
## texture_worst 2.564467e-06
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the texture variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("texture_mean", "texture_se", "texture_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## texture_mean 3.428627e-28
## texture_se 6.436927e-01
## texture_worst 6.517718e-30
Wilcoxon test results: The p-values are < 0.01 in 2 of 3 texture variables. Hence, we reject the null hypothesis. There are significant differences for texture variables (mean, worst) between the groups.
The malignant breast cancer group has the feature texture values (standard deviation of gray-scale values) higher than the benign group.
cor.test(df$texture_mean, df$texture_worst)
##
## Pearson's product-moment correlation
##
## data: df$texture_mean and df$texture_worst
## t = 52.957, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8971007 0.9249041
## sample estimates:
## cor
## 0.9120446
ggplot(df, aes(texture_mean, texture_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of texture variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, very strong (0.9120446) and statistical significance (p-value < 2.2e-16) correlation between texture_mean and texture_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the texture feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$texture_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "moderate")
## Correlation value (r): 0.4151853 moderate
b2 <- biserial.cor(df$texture_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "very weak")
## Correlation value (r): -0.008303333 very weak
b3 <- biserial.cor(df$texture_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r): 0.4569028 moderate
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$texture_mean %in% boxplot(df$texture_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 7
df[as.numeric(out_1),c("id", "diagnosis", "texture_mean")]
## id diagnosis texture_mean
## 220 88119002 M 32.47
## 233 88203002 B 33.81
## 240 88330202 M 39.28
## 260 88725602 M 33.56
## 266 88995002 M 31.12
## 456 9112085 B 30.72
## 563 925622 M 30.62
out_2 <- which(df$texture_se %in% boxplot(df$texture_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 20
df[as.numeric(out_2),c("id", "diagnosis", "texture_se")]
## id diagnosis texture_se
## 13 846226 M 3.568
## 84 8611792 M 2.910
## 123 865423 M 3.120
## 137 868223 B 2.508
## 153 8710441 B 2.664
## 193 875099 B 4.885
## 246 884437 B 2.612
## 259 887181 M 2.454
## 315 894047 B 2.777
## 346 898677 B 2.509
## 390 90312 M 2.836
## 417 905978 B 2.878
## 444 909777 B 2.542
## 472 9113816 B 2.643
## 474 9113846 B 3.647
## 529 918192 B 2.635
## 558 925236 B 2.927
## 560 925291 B 2.904
## 562 925311 B 3.896
## 566 926682 M 2.463
out_3 <- which(df$texture_worst %in% boxplot(df$texture_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 5
df[as.numeric(out_3),c("id", "diagnosis", "texture_worst")]
## id diagnosis texture_worst
## 220 88119002 M 45.41
## 240 88330202 M 44.87
## 260 88725602 M 49.54
## 266 88995002 M 47.16
## 563 925622 M 42.79
perimeter <- df %>%
dplyr::select(c(diagnosis, perimeter_mean, perimeter_se, perimeter_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_perimeter_mean = mean(perimeter_mean), Mean_perimeter_se = mean(perimeter_se), Mean_perimeter_worst = mean(perimeter_worst))
formattable(perimeter, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_perimeter_mean = color_tile("#f7d383", "#fec306"),
Mean_perimeter_se = color_tile("#eb724d", "#df5227"),
Mean_perimeter_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_perimeter_mean | Mean_perimeter_se | Mean_perimeter_worst |
|---|---|---|---|
| B | 78.07541 | 2.000321 | 87.00594 |
| M | 115.36538 | 4.323929 | 141.37033 |
The mean of perimeter variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('perimeter_mean','perimeter_se','perimeter_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x perimeter variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
Higher variability/spread for perimeter variables (mean, se, worst) was observed in the malignant breast cancer group.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=10, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the perimeter variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.15)
shapiro.tests <- t(as.data.frame(lapply(df[,c("perimeter_mean", "perimeter_se", "perimeter_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## perimeter_mean 7.011402e-15
## perimeter_se 7.587488e-30
## perimeter_worst 1.373336e-17
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the perimeter variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("perimeter_mean", "perimeter_se", "perimeter_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## perimeter_mean 3.553870e-71
## perimeter_se 5.099437e-51
## perimeter_worst 2.583004e-80
Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all perimeter variables (mean, se, worst) between the groups.
The malignant breast cancer group has the feature perimeter values higher than the benign group.
cor.test(df$perimeter_mean, df$perimeter_worst)
##
## Pearson's product-moment correlation
##
## data: df$perimeter_mean and df$perimeter_worst
## t = 95.657, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9651750 0.9748288
## sample estimates:
## cor
## 0.9703869
ggplot(df, aes(perimeter_mean, perimeter_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of perimeter variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, very strong (0.9703869) and statistical significance (p-value < 2.2e-16) correlation between perimeter_mean and perimeter_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the perimeter feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$perimeter_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r): 0.7426355 strong
b2 <- biserial.cor(df$perimeter_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "moderate")
## Correlation value (r): 0.5561407 moderate
b3 <- biserial.cor(df$perimeter_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r): 0.7829141 strong
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$perimeter_mean %in% boxplot(df$perimeter_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 13
df[as.numeric(out_1),c("id", "diagnosis", "perimeter_mean")]
## id diagnosis perimeter_mean
## 83 8611555 M 171.5
## 109 86355 M 152.8
## 123 865423 M 166.2
## 165 8712289 M 152.1
## 181 873592 M 182.1
## 203 878796 M 158.9
## 213 8810703 M 188.5
## 237 88299702 M 153.5
## 340 89812 M 155.1
## 353 899987 M 174.2
## 462 911296202 M 186.9
## 504 915143 M 152.1
## 522 91762702 M 165.5
out_2 <- which(df$perimeter_se %in% boxplot(df$perimeter_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 38
df[as.numeric(out_2),c("id", "diagnosis", "perimeter_se")]
## id diagnosis perimeter_se
## 1 842302 M 8.589
## 13 846226 M 11.070
## 26 852631 M 7.276
## 39 855133 M 8.077
## 43 855625 M 8.830
## 78 8610637 M 6.311
## 79 8610862 M 8.649
## 83 8611555 M 7.382
## 109 86355 M 10.050
## 123 865423 M 9.807
## 139 868826 M 8.419
## 162 8711803 M 6.971
## 169 8712766 M 7.337
## 211 881046502 M 7.029
## 213 8810703 M 21.980
## 219 8811842 M 6.487
## 237 88299702 M 7.247
## 251 884948 M 6.372
## 257 88649001 M 7.158
## 259 887181 M 10.120
## 263 888570 M 6.146
## 266 88995002 M 7.749
## 273 8910988 M 8.867
## 301 892438 M 7.237
## 303 89263202 M 7.804
## 336 89742801 M 6.076
## 340 89812 M 6.462
## 353 899987 M 7.222
## 367 9011494 M 7.128
## 369 9011971 M 7.733
## 370 9012000 M 7.561
## 418 90602302 M 9.424
## 461 911296201 M 6.051
## 462 911296202 M 18.650
## 504 915143 M 9.635
## 522 91762702 M 7.050
## 564 926125 M 8.758
## 565 926424 M 7.673
out_3 <- which(df$perimeter_worst %in% boxplot(df$perimeter_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 15
df[as.numeric(out_3),c("id", "diagnosis", "perimeter_worst")]
## id diagnosis perimeter_worst
## 24 851509 M 188.0
## 83 8611555 M 211.7
## 109 86355 M 206.8
## 181 873592 M 220.8
## 213 8810703 M 188.5
## 237 88299702 M 206.0
## 266 88995002 M 214.0
## 273 8910988 M 195.9
## 340 89812 M 202.4
## 353 899987 M 229.3
## 369 9011971 M 199.5
## 370 9012000 M 195.0
## 462 911296202 M 251.2
## 504 915143 M 211.5
## 522 91762702 M 205.7
area <- df %>%
dplyr::select(c(diagnosis, area_mean, area_se, area_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_area_mean = mean(area_mean), Mean_area_se = mean(area_se), Mean_area_worst = mean(area_worst))
formattable(area, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_area_mean = color_tile("#f7d383", "#fec306"),
Mean_area_se = color_tile("#eb724d", "#df5227"),
Mean_area_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_area_mean | Mean_area_se | Mean_area_worst |
|---|---|---|---|
| B | 462.7902 | 21.13515 | 558.8994 |
| M | 978.3764 | 72.67241 | 1422.2863 |
The mean of area variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('area_mean','area_se','area_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x area variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
Higher variability/spread for area variables (mean, se, worst) was observed in the malignant breast cancer group.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=170, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the area variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.015)
shapiro.tests <- t(as.data.frame(lapply(df[,c("area_mean", "area_se", "area_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## area_mean 3.196264e-22
## area_se 2.652703e-35
## area_worst 5.595364e-25
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the area variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("area_mean", "area_se", "area_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## area_mean 1.539780e-68
## area_se 5.767823e-65
## area_worst 1.803309e-78
Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all area variables (mean, se, worst) between the groups.
The malignant breast cancer group has the feature area values higher than the benign group.
cor.test(df$area_mean, df$area_worst)
##
## Pearson's product-moment correlation
##
## data: df$area_mean and df$area_worst
## t = 80.799, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9520827 0.9653017
## sample estimates:
## cor
## 0.9592133
ggplot(df, aes(area_mean, area_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of area variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, very strong (0.9592133) and statistical significance (p-value < 2.2e-16) correlation between area_mean and area_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the area feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$area_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r): 0.7089838 strong
b2 <- biserial.cor(df$area_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "moderate")
## Correlation value (r): 0.5482359 moderate
b3 <- biserial.cor(df$area_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r): 0.733825 strong
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$area_mean %in% boxplot(df$area_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 25
df[as.numeric(out_1),c("id", "diagnosis", "area_mean")]
## id diagnosis area_mean
## 24 851509 M 1404
## 83 8611555 M 1878
## 109 86355 M 1509
## 123 865423 M 1761
## 165 8712289 M 1686
## 181 873592 M 2250
## 203 878796 M 1685
## 213 8810703 M 2499
## 237 88299702 M 1670
## 251 884948 M 1364
## 266 88995002 M 1419
## 273 8910988 M 1491
## 340 89812 M 1747
## 353 899987 M 2010
## 369 9011971 M 1546
## 370 9012000 M 1482
## 373 9012795 M 1386
## 374 901288 M 1335
## 394 903516 M 1407
## 450 911157302 M 1384
## 462 911296202 M 2501
## 504 915143 M 1682
## 522 91762702 M 1841
## 564 926125 M 1347
## 565 926424 M 1479
out_2 <- which(df$area_se %in% boxplot(df$area_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 65
df[as.numeric(out_2),c("id", "diagnosis", "area_se")]
## id diagnosis area_se
## 1 842302 M 153.40
## 3 84300903 M 94.03
## 5 84358402 M 94.44
## 13 846226 M 116.20
## 19 849014 M 112.40
## 24 851509 M 93.99
## 25 852552 M 102.60
## 26 852631 M 111.40
## 28 852781 M 93.54
## 31 853401 M 105.00
## 39 855133 M 106.00
## 43 855625 M 104.90
## 54 857392 M 98.81
## 57 857637 M 102.50
## 71 859575 M 96.05
## 78 8610637 M 134.80
## 79 8610862 M 116.40
## 83 8611555 M 120.00
## 96 86208 M 87.87
## 109 86355 M 170.00
## 122 86517 M 90.47
## 123 865423 M 233.00
## 139 868826 M 101.90
## 157 8711202 M 93.91
## 162 8711803 M 119.30
## 163 871201 M 97.07
## 165 8712289 M 97.85
## 169 8712766 M 122.30
## 181 873592 M 128.70
## 211 881046502 M 111.70
## 213 8810703 M 525.60
## 219 8811842 M 124.40
## 220 88119002 M 109.90
## 237 88299702 M 155.80
## 251 884948 M 137.90
## 253 885429 M 92.81
## 257 88649001 M 106.40
## 259 887181 M 138.50
## 263 888570 M 90.94
## 266 88995002 M 199.70
## 273 8910988 M 156.80
## 301 892438 M 133.00
## 303 89263202 M 130.80
## 336 89742801 M 87.17
## 338 897630 M 88.25
## 340 89812 M 164.10
## 353 899987 M 153.10
## 367 9011494 M 103.60
## 369 9011971 M 224.10
## 370 9012000 M 130.20
## 418 90602302 M 176.50
## 434 908445 M 103.90
## 461 911296201 M 115.20
## 462 911296202 M 542.20
## 469 9113538 M 104.90
## 493 914062 M 89.74
## 499 914769 M 95.77
## 504 915143 M 180.20
## 522 91762702 M 139.90
## 534 91930402 M 100.40
## 536 919555 M 87.78
## 564 926125 M 118.80
## 565 926424 M 158.70
## 566 926682 M 99.04
## 568 927241 M 86.22
out_3 <- which(df$area_worst %in% boxplot(df$area_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 35
df[as.numeric(out_3),c("id", "diagnosis", "area_worst")]
## id diagnosis area_worst
## 1 842302 M 2019
## 2 842517 M 1956
## 19 849014 M 2398
## 24 851509 M 2615
## 25 852552 M 2215
## 57 857637 M 2145
## 83 8611555 M 2562
## 109 86355 M 2360
## 123 865423 M 2073
## 163 871201 M 2232
## 165 8712289 M 2403
## 181 873592 M 3216
## 182 873593 M 2089
## 203 878796 M 1986
## 213 8810703 M 2499
## 219 8811842 M 2009
## 220 88119002 M 2477
## 237 88299702 M 2944
## 251 884948 M 2010
## 255 886226 M 1972
## 266 88995002 M 3432
## 273 8910988 M 2384
## 301 892438 M 2053
## 324 895100 M 1938
## 340 89812 M 2906
## 353 899987 M 3234
## 369 9011971 M 3143
## 370 9012000 M 2227
## 374 901288 M 1946
## 394 903516 M 2081
## 450 911157302 M 2022
## 462 911296202 M 4254
## 504 915143 M 2782
## 522 91762702 M 2642
## 565 926424 M 2027
smoothness <- df %>%
dplyr::select(c(diagnosis, smoothness_mean, smoothness_se, smoothness_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_smoothness_mean = mean(smoothness_mean), Mean_smoothness_se = mean(smoothness_se), Mean_smoothness_worst = mean(smoothness_worst))
formattable(smoothness, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_smoothness_mean = color_tile("#f7d383", "#fec306"),
Mean_smoothness_se = color_tile("#eb724d", "#df5227"),
Mean_smoothness_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_smoothness_mean | Mean_smoothness_se | Mean_smoothness_worst |
|---|---|---|---|
| B | 0.09247765 | 0.007195902 | 0.1249595 |
| M | 0.10289849 | 0.006780094 | 0.1448452 |
The mean of smoothness variables (mean, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('smoothness_mean','smoothness_se','smoothness_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x smoothness variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
The variability/spread for smoothness variables (mean, se, worst) seems to be similar between the groups.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=0.001, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the smoothness variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.6)
shapiro.tests <- t(as.data.frame(lapply(df[,c("smoothness_mean", "smoothness_se", "smoothness_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## smoothness_mean 8.600833e-05
## smoothness_se 1.361967e-23
## smoothness_worst 2.096993e-04
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the smoothness variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("smoothness_mean", "smoothness_se", "smoothness_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## smoothness_mean 7.793007e-19
## smoothness_se 2.136316e-01
## smoothness_worst 3.637942e-24
Wilcoxon test results: The p-values are < 0.01 in 2 of 3 texture variables. Hence, we reject the null hypothesis. There are significant differences for smoothness variables (mean, worst) between the groups.
The malignant breast cancer group has the feature smoothness values (local variation in radius lengths) higher than the benign group.
cor.test(df$smoothness_mean, df$smoothness_worst)
##
## Pearson's product-moment correlation
##
## data: df$smoothness_mean and df$smoothness_worst
## t = 32.347, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7743878 0.8324192
## sample estimates:
## cor
## 0.8053242
ggplot(df, aes(smoothness_mean, smoothness_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of smoothness variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, strong (0.8053242) and statistical significance (p-value < 2.2e-16) correlation between smoothness_mean and smoothness_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the smoothness feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$smoothness_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "weak")
## Correlation value (r): 0.35856 weak
b2 <- biserial.cor(df$smoothness_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "very weak")
## Correlation value (r): -0.06701601 very weak
b3 <- biserial.cor(df$smoothness_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r): 0.4214649 moderate
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$smoothness_mean %in% boxplot(df$smoothness_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 6
df[as.numeric(out_1),c("id", "diagnosis", "smoothness_mean")]
## id diagnosis smoothness_mean
## 4 84348301 M 0.14250
## 106 863030 M 0.13980
## 123 865423 M 0.14470
## 505 915186 B 0.16340
## 521 917092 B 0.13710
## 569 92751 B 0.05263
out_2 <- which(df$smoothness_se %in% boxplot(df$smoothness_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 30
df[as.numeric(out_2),c("id", "diagnosis", "smoothness_se")]
## id diagnosis smoothness_se
## 72 859711 B 0.01721
## 77 8610629 B 0.01340
## 111 864033 B 0.01385
## 112 86408 B 0.01291
## 117 864726 B 0.01835
## 123 865423 M 0.02333
## 174 871641 B 0.01496
## 177 872608 B 0.01286
## 186 874158 B 0.01439
## 197 875938 M 0.01380
## 213 8810703 M 0.01345
## 214 881094802 M 0.03113
## 246 884437 B 0.01604
## 274 8910996 B 0.01380
## 276 8911164 B 0.01418
## 289 8913049 B 0.01574
## 315 894047 B 0.02075
## 333 897132 B 0.01289
## 346 898677 B 0.01736
## 392 903483 B 0.01582
## 417 905978 B 0.01474
## 425 907145 B 0.01307
## 470 911366 B 0.01459
## 506 915276 B 0.02177
## 508 91544002 B 0.01262
## 521 917092 B 0.01546
## 538 919812 B 0.01288
## 539 921092 B 0.01266
## 540 921362 B 0.01547
## 557 924964 B 0.01291
out_3 <- which(df$smoothness_worst %in% boxplot(df$smoothness_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 7
df[as.numeric(out_3),c("id", "diagnosis", "smoothness_worst")]
## id diagnosis smoothness_worst
## 4 84348301 M 0.20980
## 42 855563 M 0.19090
## 193 875099 B 0.07117
## 204 87880 M 0.22260
## 380 9013838 M 0.21840
## 505 915186 B 0.19020
## 506 915276 B 0.20060
compactness <- df %>%
dplyr::select(c(diagnosis, compactness_mean, compactness_se, compactness_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_compactness_mean = mean(compactness_mean), Mean_compactness_se = mean(compactness_se), Mean_compactness_worst = mean(compactness_worst))
formattable(compactness, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_compactness_mean = color_tile("#f7d383", "#fec306"),
Mean_compactness_se = color_tile("#eb724d", "#df5227"),
Mean_compactness_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_compactness_mean | Mean_compactness_se | Mean_compactness_worst |
|---|---|---|---|
| B | 0.08008462 | 0.02143825 | 0.1826725 |
| M | 0.14518778 | 0.03228117 | 0.3748241 |
The mean of compactness variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('compactness_mean','compactness_se','compactness_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x compactness variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
Higher variability/spread for compactness variables (mean, se, worst) was observed in the malignant breast cancer group.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=0.05, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the compactness variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.5)
shapiro.tests <- t(as.data.frame(lapply(df[,c("compactness_mean", "compactness_se", "compactness_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## compactness_mean 3.967204e-17
## compactness_se 1.082957e-23
## compactness_worst 1.247461e-19
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the compactness variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("compactness_mean", "compactness_se", "compactness_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## compactness_mean 8.951992e-48
## compactness_se 1.168061e-19
## compactness_worst 2.115525e-47
Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all compactness variables (mean, se, worst) between the groups.
The malignant breast cancer group has the feature compactness values (perimeter^2 / area - 1.0) higher than the benign group.
cor.test(df$compactness_mean, df$compactness_worst)
##
## Pearson's product-moment correlation
##
## data: df$compactness_mean and df$compactness_worst
## t = 41.202, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8436520 0.8850219
## sample estimates:
## cor
## 0.865809
ggplot(df, aes(compactness_mean, compactness_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of compactness variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, very strong (0.865809) and statistical significance (p-value < 2.2e-16) correlation between compactness_mean and compactness_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the compactness feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$compactness_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "moderate")
## Correlation value (r): 0.5965337 moderate
b2 <- biserial.cor(df$compactness_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "weak")
## Correlation value (r): 0.2929992 weak
b3 <- biserial.cor(df$compactness_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r): 0.5909982 moderate
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$compactness_mean %in% boxplot(df$compactness_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 16
df[as.numeric(out_1),c("id", "diagnosis", "compactness_mean")]
## id diagnosis compactness_mean
## 1 842302 M 0.2776
## 4 84348301 M 0.2839
## 10 84501001 M 0.2396
## 13 846226 M 0.2458
## 15 84667401 M 0.2293
## 79 8610862 M 0.3454
## 83 8611555 M 0.2665
## 109 86355 M 0.2768
## 123 865423 M 0.2867
## 182 873593 M 0.2832
## 191 874858 M 0.2413
## 259 887181 M 0.3114
## 352 899667 M 0.2364
## 353 899987 M 0.2363
## 401 90439701 M 0.2576
## 568 927241 M 0.2770
out_2 <- which(df$compactness_se %in% boxplot(df$compactness_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 28
df[as.numeric(out_2),c("id", "diagnosis", "compactness_se")]
## id diagnosis compactness_se
## 4 84348301 M 0.07458
## 10 84501001 M 0.07217
## 13 846226 M 0.08297
## 43 855625 M 0.10060
## 63 858986 M 0.07056
## 69 859471 B 0.08606
## 72 859711 B 0.09368
## 79 8610862 M 0.06835
## 109 86355 M 0.08668
## 113 86409 B 0.07446
## 117 864726 B 0.06760
## 123 865423 M 0.09806
## 153 8710441 B 0.09586
## 177 872608 B 0.08808
## 191 874858 M 0.13540
## 214 881094802 M 0.08555
## 289 8913049 B 0.08262
## 291 89143602 B 0.10640
## 319 894329 B 0.06590
## 352 899667 M 0.06559
## 377 901315 B 0.07643
## 389 903011 B 0.06669
## 431 907914 M 0.06213
## 466 9113239 B 0.06657
## 469 9113538 M 0.07025
## 486 913063 B 0.07471
## 540 921362 B 0.06457
## 568 927241 M 0.06158
out_3 <- which(df$compactness_worst %in% boxplot(df$compactness_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 16
df[as.numeric(out_3),c("id", "diagnosis", "compactness_worst")]
## id diagnosis compactness_worst
## 1 842302 M 0.6656
## 4 84348301 M 0.8663
## 10 84501001 M 1.0580
## 15 84667401 M 0.7725
## 16 84799002 M 0.6577
## 27 852763 M 0.6643
## 34 854002 M 0.6590
## 43 855625 M 0.7444
## 73 859717 M 0.7394
## 109 86355 M 0.6997
## 182 873593 M 0.7584
## 191 874858 M 0.9327
## 380 9013838 M 0.9379
## 431 907914 M 0.7090
## 563 925622 M 0.7917
## 568 927241 M 0.8681
concavity <- df %>%
dplyr::select(c(diagnosis, concavity_mean, concavity_se, concavity_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_concavity_mean = mean(concavity_mean), Mean_concavity_se = mean(concavity_se), Mean_concavity_worst = mean(concavity_worst))
formattable(concavity, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_concavity_mean = color_tile("#f7d383", "#fec306"),
Mean_concavity_se = color_tile("#eb724d", "#df5227"),
Mean_concavity_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_concavity_mean | Mean_concavity_se | Mean_concavity_worst |
|---|---|---|---|
| B | 0.04605762 | 0.02599674 | 0.1662377 |
| M | 0.16077472 | 0.04182401 | 0.4506056 |
The mean of concavity variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('concavity_mean','concavity_se','concavity_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x concavity variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
Higher variability/spread for concavity variables (mean, se, worst) was observed in the malignant breast cancer group.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=0.05, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the concavity variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.5)
shapiro.tests <- t(as.data.frame(lapply(df[,c("concavity_mean", "concavity_se", "concavity_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## concavity_mean 1.338571e-21
## concavity_se 1.101681e-31
## concavity_worst 4.543300e-17
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the concavity variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("concavity_mean", "concavity_se", "concavity_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## concavity_mean 2.164549e-68
## concavity_se 3.675508e-29
## concavity_worst 1.761723e-63
Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all concavity variables (mean, se, worst) between the groups.
The malignant breast cancer group has the feature concavity values (severity of concave portions of the contour) higher than the benign group.
cor.test(df$concavity_mean, df$concavity_worst)
##
## Pearson's product-moment correlation
##
## data: df$concavity_mean and df$concavity_worst
## t = 45.051, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8647472 0.9008355
## sample estimates:
## cor
## 0.8841026
ggplot(df, aes(concavity_mean, concavity_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of concavity variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, very strong (0.8841026) and statistical significance (p-value < 2.2e-16) correlation between concavity_mean and concavity_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the concavity feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$concavity_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r): 0.6963597 strong
b2 <- biserial.cor(df$concavity_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "weak")
## Correlation value (r): 0.2537298 weak
b3 <- biserial.cor(df$concavity_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r): 0.6596102 strong
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$concavity_mean %in% boxplot(df$concavity_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 18
df[as.numeric(out_1),c("id", "diagnosis", "concavity_mean")]
## id diagnosis concavity_mean
## 1 842302 M 0.3001
## 69 859471 B 0.3130
## 79 8610862 M 0.3754
## 83 8611555 M 0.3339
## 109 86355 M 0.4264
## 113 86409 B 0.3003
## 123 865423 M 0.4268
## 153 8710441 B 0.4108
## 181 873592 M 0.2871
## 203 878796 M 0.3523
## 213 8810703 M 0.3201
## 259 887181 M 0.3176
## 352 899667 M 0.2914
## 353 899987 M 0.3368
## 401 90439701 M 0.3189
## 462 911296202 M 0.3635
## 564 926125 M 0.3174
## 568 927241 M 0.3514
out_2 <- which(df$concavity_se %in% boxplot(df$concavity_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 22
df[as.numeric(out_2),c("id", "diagnosis", "concavity_se")]
## id diagnosis concavity_se
## 13 846226 M 0.08890
## 43 855625 M 0.09723
## 69 859471 B 0.30380
## 79 8610862 M 0.10910
## 109 86355 M 0.10400
## 113 86409 B 0.14350
## 117 864726 B 0.09263
## 123 865423 M 0.12780
## 153 8710441 B 0.39600
## 177 872608 B 0.11970
## 191 874858 M 0.11660
## 203 878796 M 0.08958
## 214 881094802 M 0.14380
## 243 883852 B 0.08880
## 251 884948 M 0.09518
## 291 89143602 B 0.09960
## 319 894329 B 0.10270
## 352 899667 M 0.09953
## 377 901315 B 0.15350
## 389 903011 B 0.09472
## 486 913063 B 0.11140
## 540 921362 B 0.09252
out_3 <- which(df$concavity_worst %in% boxplot(df$concavity_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 12
df[as.numeric(out_3),c("id", "diagnosis", "concavity_worst")]
## id diagnosis concavity_worst
## 10 84501001 M 1.1050
## 69 859471 B 1.2520
## 109 86355 M 0.9608
## 153 8710441 B 0.8216
## 191 874858 M 0.8488
## 203 878796 M 0.7892
## 253 885429 M 0.8489
## 380 9013838 M 0.8402
## 401 90439701 M 0.9034
## 431 907914 M 0.9019
## 563 925622 M 1.1700
## 568 927241 M 0.9387
concave_points <- df %>%
dplyr::select(c(diagnosis, concave_points_mean, concave_points_se, concave_points_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_concave_points_mean = mean(concave_points_mean), Mean_concave_points_se = mean(concave_points_se), Mean_concave_points_worst = mean(concave_points_worst))
formattable(concave_points, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_concave_points_mean = color_tile("#f7d383", "#fec306"),
Mean_concave_points_se = color_tile("#eb724d", "#df5227"),
Mean_concave_points_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_concave_points_mean | Mean_concave_points_se | Mean_concave_points_worst |
|---|---|---|---|
| B | 0.02571741 | 0.009857653 | 0.07444434 |
| M | 0.08799000 | 0.015060472 | 0.18223731 |
The mean of concave_points variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('concave_points_mean','concave_points_se','concave_points_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x concave_points variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
Higher variability/spread for concave_points variables (mean, se, worst) was observed in the malignant breast cancer group.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=0.02, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the concave_points variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.5)
shapiro.tests <- t(as.data.frame(lapply(df[,c("concave_points_mean", "concave_points_se", "concave_points_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## concave_points_mean 1.404556e-19
## concave_points_se 7.825998e-17
## concave_points_worst 1.984878e-10
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the concave_points variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("concave_points_mean", "concave_points_se", "concave_points_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## concave_points_mean 1.006324e-76
## concave_points_se 2.370852e-31
## concave_points_worst 1.863997e-77
Wilcoxon test results: The p-values are < 0.01. Hence, we reject the null hypothesis. There are significant differences for all concave_points variables (mean, se, worst) between the groups.
The malignant breast cancer group has the feature concave_points values (number of concave portions of the contour) higher than the benign group.
cor.test(df$concave_points_mean, df$concave_points_worst)
##
## Pearson's product-moment correlation
##
## data: df$concave_points_mean and df$concave_points_worst
## t = 52.315, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8949081 0.9232799
## sample estimates:
## cor
## 0.9101553
ggplot(df, aes(concave_points_mean, concave_points_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of concave_points variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, very strong (0.9101553) and statistical significance (p-value < 2.2e-16) correlation between concave_points_mean and concave_points_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the concave_points feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$concave_points_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "strong")
## Correlation value (r): 0.7766138 strong
b2 <- biserial.cor(df$concave_points_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "moderate")
## Correlation value (r): 0.4080423 moderate
b3 <- biserial.cor(df$concave_points_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "strong")
## Correlation value (r): 0.793566 strong
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$concave_points_mean %in% boxplot(df$concave_points_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 10
df[as.numeric(out_1),c("id", "diagnosis", "concave_points_mean")]
## id diagnosis concave_points_mean
## 79 8610862 M 0.1604
## 83 8611555 M 0.1845
## 109 86355 M 0.1823
## 123 865423 M 0.2012
## 181 873592 M 0.1878
## 203 878796 M 0.1620
## 213 8810703 M 0.1595
## 353 899987 M 0.1913
## 394 903516 M 0.1562
## 462 911296202 M 0.1689
out_2 <- which(df$concave_points_se %in% boxplot(df$concave_points_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 19
df[as.numeric(out_2),c("id", "diagnosis", "concave_points_se")]
## id diagnosis concave_points_se
## 13 846226 M 0.04090
## 43 855625 M 0.02638
## 69 859471 B 0.03322
## 79 8610862 M 0.02593
## 139 868826 M 0.02801
## 153 8710441 B 0.05279
## 162 8711803 M 0.02794
## 211 881046502 M 0.02765
## 214 881094802 M 0.03927
## 259 887181 M 0.03024
## 289 8913049 B 0.03487
## 291 89143602 B 0.02771
## 367 9011494 M 0.02536
## 377 901315 B 0.02919
## 390 90312 M 0.03441
## 462 911296202 M 0.02598
## 486 913063 B 0.02721
## 529 918192 B 0.02853
## 564 926125 M 0.02624
out_3 <- which(df$concave_points_worst %in% boxplot(df$concave_points_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 0
df[as.numeric(out_3),c("id", "diagnosis", "concave_points_worst")]
## [1] id diagnosis concave_points_worst
## <0 rows> (or 0-length row.names)
symmetry <- df %>%
dplyr::select(c(diagnosis, symmetry_mean, symmetry_se, symmetry_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_symmetry_mean = mean(symmetry_mean), Mean_symmetry_se = mean(symmetry_se), Mean_symmetry_worst = mean(symmetry_worst))
formattable(symmetry, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_symmetry_mean = color_tile("#f7d383", "#fec306"),
Mean_symmetry_se = color_tile("#eb724d", "#df5227"),
Mean_symmetry_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_symmetry_mean | Mean_symmetry_se | Mean_symmetry_worst |
|---|---|---|---|
| B | 0.174186 | 0.02058381 | 0.2702459 |
| M | 0.192909 | 0.02047240 | 0.3234679 |
The mean of symmetry variables (mean, se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('symmetry_mean','symmetry_se','symmetry_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x symmetry variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
Higher variability/spread for symmetry variables (mean, se, worst) was observed in the malignant breast cancer group.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=0.04, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the symmetry variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.5)
shapiro.tests <- t(as.data.frame(lapply(df[,c("symmetry_mean", "symmetry_se", "symmetry_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## symmetry_mean 7.884773e-09
## symmetry_se 3.126807e-24
## symmetry_worst 3.233785e-17
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the symmetry variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("symmetry_mean", "symmetry_se", "symmetry_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## symmetry_mean 2.268050e-15
## symmetry_se 2.783664e-02
## symmetry_worst 3.151237e-21
Wilcoxon test results: The p-values are < 0.01 in 2 of 3 texture variables. Hence, we reject the null hypothesis. There are significant differences for symmetry variables (mean, worst) between the groups.
The malignant breast cancer group has the feature symmetry values higher than the benign group.
cor.test(df$symmetry_mean, df$symmetry_worst)
##
## Pearson's product-moment correlation
##
## data: df$symmetry_mean and df$symmetry_worst
## t = 23.329, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6553251 0.7394852
## sample estimates:
## cor
## 0.6998258
ggplot(df, aes(symmetry_mean, symmetry_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of symmetry variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, moderate (0.6998258) and statistical significance (p-value < 2.2e-16) correlation between symmetry_mean and symmetry_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the symmetry feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$symmetry_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "weak")
## Correlation value (r): 0.3304986 weak
b2 <- biserial.cor(df$symmetry_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "very weak")
## Correlation value (r): -0.006521756 very weak
b3 <- biserial.cor(df$symmetry_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r): 0.4162943 moderate
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$symmetry_mean %in% boxplot(df$symmetry_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 15
df[as.numeric(out_1),c("id", "diagnosis", "symmetry_mean")]
## id diagnosis symmetry_mean
## 4 84348301 M 0.2597
## 23 8511133 M 0.2521
## 26 852631 M 0.3040
## 61 858970 B 0.2743
## 79 8610862 M 0.2906
## 109 86355 M 0.2556
## 123 865423 M 0.2655
## 147 869691 M 0.2678
## 151 871001501 B 0.2540
## 153 8710441 B 0.2548
## 259 887181 M 0.2495
## 289 8913049 B 0.2595
## 324 895100 M 0.2569
## 425 907145 B 0.2538
## 562 925311 B 0.1060
out_2 <- which(df$symmetry_se %in% boxplot(df$symmetry_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 27
df[as.numeric(out_2),c("id", "diagnosis", "symmetry_se")]
## id diagnosis symmetry_se
## 4 84348301 M 0.05963
## 13 846226 M 0.04484
## 23 8511133 M 0.03672
## 43 855625 M 0.05333
## 61 858970 B 0.04183
## 64 859196 B 0.04192
## 69 859471 B 0.04197
## 79 8610862 M 0.07895
## 120 865128 M 0.05014
## 123 865423 M 0.04547
## 139 868826 M 0.05168
## 147 869691 M 0.05628
## 177 872608 B 0.03880
## 191 874858 M 0.05113
## 193 875099 B 0.03799
## 213 8810703 M 0.04783
## 215 8810955 M 0.04499
## 291 89143602 B 0.04077
## 315 894047 B 0.06146
## 330 895633 M 0.04022
## 333 897132 B 0.04243
## 344 898431 M 0.03756
## 346 898677 B 0.03675
## 352 899667 M 0.05543
## 367 9011494 M 0.03710
## 521 917092 B 0.03997
## 554 924342 B 0.03759
out_3 <- which(df$symmetry_worst %in% boxplot(df$symmetry_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 23
df[as.numeric(out_3),c("id", "diagnosis", "symmetry_worst")]
## id diagnosis symmetry_worst
## 1 842302 M 0.4601
## 4 84348301 M 0.6638
## 9 844981 M 0.4378
## 10 84501001 M 0.4366
## 16 84799002 M 0.4218
## 23 8511133 M 0.4667
## 27 852763 M 0.4264
## 32 853612 M 0.4761
## 35 854039 M 0.4270
## 36 854253 M 0.4863
## 43 855625 M 0.4670
## 69 859471 B 0.4228
## 79 8610862 M 0.5440
## 120 865128 M 0.4882
## 147 869691 M 0.5774
## 191 874858 M 0.5166
## 200 877500 M 0.4753
## 204 87880 M 0.4432
## 215 8810955 M 0.4724
## 324 895100 M 0.5558
## 352 899667 M 0.4245
## 371 9012315 M 0.4824
## 490 913535 M 0.4677
fractal_dimension <- df %>%
dplyr::select(c(diagnosis, fractal_dimension_mean, fractal_dimension_se, fractal_dimension_worst)) %>%
group_by(diagnosis) %>%
summarise(Mean_fractal_dimension_mean = mean(fractal_dimension_mean), Mean_fractal_dimension_se = mean(fractal_dimension_se), Mean_fractal_dimension_worst = mean(fractal_dimension_worst))
formattable(fractal_dimension, list(
diagnosis = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
Mean_fractal_dimension_mean = color_tile("#f7d383", "#fec306"),
Mean_fractal_dimension_se = color_tile("#eb724d", "#df5227"),
Mean_fractal_dimension_worst = color_tile("#b8ddf2", "#56B4E9")))
| diagnosis | Mean_fractal_dimension_mean | Mean_fractal_dimension_se | Mean_fractal_dimension_worst |
|---|---|---|---|
| B | 0.06286739 | 0.003636051 | 0.07944207 |
| M | 0.06268009 | 0.004062406 | 0.09152995 |
The mean of fractal_dimension variables (se, worst) are higher in the malignant breast cancer group as compared to the benign breast cancer group. The mean of fractal_dimension_mean is similar in both groups.
test.m <- melt(df,id.vars='diagnosis', measure.vars=c('fractal_dimension_mean','fractal_dimension_se','fractal_dimension_worst'))
ggplot(test.m, aes(x=diagnosis, y=value, fill=variable)) +
geom_boxplot(alpha = 2/3) +
labs(x = 'diagnosis') +
scale_fill_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
scale_color_manual(values=c("#fec306", "#df5227", "#56B4E9")) +
theme_bw() + ggtitle("diagnosis x fractal_dimension variables") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
geom_jitter(alpha = I(1/4), aes(color = variable)) +
stat_summary(fun.y=mean, geom="text", size=3, vjust=-3, aes( label=round(..y.., digits=2)))
Higher variability/spread for fractal_dimension variables (mean, se, worst) was observed in the malignant breast cancer group.
ggplot(test.m, aes(x=value)) +
geom_histogram(binwidth=0.02, aes(y=..density..), position="identity", alpha=0.7, color="black") +
geom_density(alpha=0.4, color = NA) +
labs(x = "", y = "Count", title = 'Distribution of the fractal_dimension variables') + theme_bw() +
aes(fill = variable) +
scale_fill_manual(values = c("#fec306", "#df5227", "#56B4E9")) +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(~variable) +
ylim(0, 0.5)
shapiro.tests <- t(as.data.frame(lapply(df[,c("fractal_dimension_mean", "fractal_dimension_se", "fractal_dimension_worst")], function(x) shapiro.test(x)$p.value)))
colnames(shapiro.tests) <- "p-value"
as.data.frame(shapiro.tests)
## p-value
## fractal_dimension_mean 1.956575e-16
## fractal_dimension_se 8.551018e-31
## fractal_dimension_worst 9.195146e-20
Normal distribution verification: The shapiro test function and the histogram distribution shape confirmed that the fractal_dimension variables (mean, se, worst) do not present a normal distribution, thus we applied non-parametric test - the unpaired two-samples Wilcoxon test (also known as Mann-Whitney test).
wilcox.tests <- t(as.data.frame(lapply(df[,c("fractal_dimension_mean", "fractal_dimension_se", "fractal_dimension_worst")], function(x) wilcox.test(x ~ df$diagnosis, conf.level = 0.99)$p.value)))
colnames(wilcox.tests) <- "p-value"
as.data.frame(wilcox.tests)
## p-value
## fractal_dimension_mean 5.371856e-01
## fractal_dimension_se 1.572165e-06
## fractal_dimension_worst 1.144240e-13
Wilcoxon test results: The p-values are < 0.01 in 2 of 3 texture variables. Hence, we reject the null hypothesis. There are significant differences for fractal_dimension variables (se, worst) between the groups.
The malignant breast cancer group has the feature fractal_dimension values (“coastline approximation” - 1) higher than the benign group only in fractal_dimension_se and fractal_dimension_worst.
cor.test(df$fractal_dimension_mean, df$fractal_dimension_worst)
##
## Pearson's product-moment correlation
##
## data: df$fractal_dimension_mean and df$fractal_dimension_worst
## t = 28.49, df = 567, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7312170 0.7990954
## sample estimates:
## cor
## 0.7672968
ggplot(df, aes(fractal_dimension_mean, fractal_dimension_worst)) +
geom_point(aes(color = diagnosis), size = 1, alpha = 0.4) +
scale_color_manual(values = c("#f69400", "#838383")) +
scale_fill_manual(values = c("#f69400", "#838383")) +
facet_wrap(~diagnosis) +
stat_smooth( aes(color = diagnosis, fill = diagnosis), method = "lm") +
stat_cor(aes(color = diagnosis), label.y = 4.4) +
stat_poly_eq(
aes(color = diagnosis, label = ..eq.label..),
formula = y ~ x, label.y = 4.2, parse = TRUE) +
theme_bw() +
ggtitle("Correlation of fractal_dimension variables") +
theme(plot.title = element_text(hjust = 0.5))
Correlation analysis: The analysis showed a positive, strong (0.7672968) and statistical significance (p-value < 2.2e-16) correlation between fractal_dimension_mean and fractal_dimension_worst variables.
A point-biserial correlation, used to measure the strength and direction of the association between continuous and binary variables, was carried out in order to verify the correlation between the fractal_dimension feature and the diagnosis (benign or malignant).
b1 <- biserial.cor(df$fractal_dimension_mean, df$diagnosis, level = 2) # Level 2 = the malignant breast cancer group
cat("Correlation value (r): ", b1, "very weak")
## Correlation value (r): -0.0128376 very weak
b2 <- biserial.cor(df$fractal_dimension_se, df$diagnosis, level = 2)
cat("Correlation value (r): ", b2, "very weak")
## Correlation value (r): 0.07797242 very weak
b3 <- biserial.cor(df$fractal_dimension_worst, df$diagnosis, level = 2)
cat("Correlation value (r): ", b3, "moderate")
## Correlation value (r): 0.3238722 moderate
Identifying extreme values: A commonly used rule (Tukey’s rule) says that the outliers (extreme value, in this case) are values more than 1.5 times the interquartile range from the quartiles, either below Q1 − (1.5 times IQR) or above Q3 + (1.5 times IQR). So, we quantified the outliers in order to better understand/characterize the data distribution and improve the results interpretation, since extreme values could bias the statistic inferences and the predict models.
out_1 <- which(df$fractal_dimension_mean %in% boxplot(df$fractal_dimension_mean, plot=FALSE)$out)
n.out_1 <- length(out_1)
cat("Number of Extreme Values:", n.out_1)
## Number of Extreme Values: 15
df[as.numeric(out_1),c("id", "diagnosis", "fractal_dimension_mean")]
## id diagnosis fractal_dimension_mean
## 4 84348301 M 0.09744
## 10 84501001 M 0.08243
## 69 859471 B 0.08046
## 72 859711 B 0.08980
## 79 8610862 M 0.08142
## 152 871001502 B 0.08261
## 153 8710441 B 0.09296
## 177 872608 B 0.08116
## 259 887181 M 0.08104
## 319 894329 B 0.08743
## 377 901315 B 0.08450
## 380 9013838 M 0.07950
## 505 915186 B 0.09502
## 506 915276 B 0.09575
## 508 91544002 B 0.07976
out_2 <- which(df$fractal_dimension_se %in% boxplot(df$fractal_dimension_se, plot=FALSE)$out)
n.out_2 <- length(out_2)
cat("Number of Extreme Values:", n.out_2)
## Number of Extreme Values: 28
df[as.numeric(out_2),c("id", "diagnosis", "fractal_dimension_se")]
## id diagnosis fractal_dimension_se
## 4 84348301 M 0.009208
## 10 84501001 M 0.010080
## 13 846226 M 0.012840
## 15 84667401 M 0.008093
## 69 859471 B 0.009559
## 72 859711 B 0.021930
## 84 8611792 M 0.010390
## 113 86409 B 0.012980
## 123 865423 M 0.009875
## 146 869476 B 0.009423
## 148 86973701 B 0.009368
## 152 871001502 B 0.011780
## 153 8710441 B 0.029840
## 177 872608 B 0.017920
## 191 874858 M 0.011720
## 214 881094802 M 0.012560
## 243 883852 B 0.008675
## 258 886776 M 0.008660
## 291 89143602 B 0.022860
## 377 901315 B 0.012200
## 389 903011 B 0.012330
## 451 9111596 B 0.008925
## 466 9113239 B 0.008133
## 469 9113538 M 0.011300
## 486 913063 B 0.009627
## 505 915186 B 0.010450
## 506 915276 B 0.011480
## 508 91544002 B 0.008313
out_3 <- which(df$fractal_dimension_worst %in% boxplot(df$fractal_dimension_worst, plot=FALSE)$out)
n.out_3 <- length(out_3)
cat("Number of Extreme Values:", n.out_3)
## Number of Extreme Values: 24
df[as.numeric(out_3),c("id", "diagnosis", "fractal_dimension_worst")]
## id diagnosis fractal_dimension_worst
## 4 84348301 M 0.1730
## 6 843786 M 0.1244
## 10 84501001 M 0.2075
## 15 84667401 M 0.1431
## 16 84799002 M 0.1341
## 27 852763 M 0.1275
## 32 853612 M 0.1402
## 35 854039 M 0.1233
## 73 859717 M 0.1339
## 106 863030 M 0.1405
## 119 864877 M 0.1252
## 152 871001502 B 0.1486
## 153 8710441 B 0.1259
## 182 873593 M 0.1284
## 191 874858 M 0.1446
## 230 881861 M 0.1243
## 243 883852 B 0.1297
## 253 885429 M 0.1297
## 380 9013838 M 0.1403
## 466 9113239 B 0.1249
## 505 915186 B 0.1252
## 506 915276 B 0.1364
## 563 925622 M 0.1409
## 568 927241 M 0.1240
Correlation analysis for all features (30):
df.n <- subset(df, select = -c(id, diagnosis))
corrplot(cor(df.n), type="lower", number.cex = .35, addCoef.col = "black", tl.col = "black", tl.srt = 90, tl.cex = .5, col=brewer.pal(n=8, name="RdBu"), order = "FPC")
The most strong correlations values (0.80 - 0.999) are showed bellow:
cor.sig <- as.data.frame(as.table(cor(df.n)))
cor.sig <- subset(cor.sig, c(abs(Freq) > 0.8 & abs(Freq) != 1))
cor.sig %<>% distinct(Freq, .keep_all = TRUE)
colnames(cor.sig) <- c("Variables_1", "Variables_2", "Correlation Value")
cor.sig[order(-cor.sig$'Correlation Value'),]
## Variables_1 Variables_2 Correlation Value
## 1 perimeter_mean radius_mean 0.9978553
## 37 perimeter_worst radius_worst 0.9937079
## 2 area_mean radius_mean 0.9873572
## 8 area_mean perimeter_mean 0.9865068
## 38 area_worst radius_worst 0.9840146
## 39 area_worst perimeter_worst 0.9775781
## 31 perimeter_se radius_se 0.9727937
## 11 perimeter_worst perimeter_mean 0.9703869
## 4 radius_worst radius_mean 0.9695390
## 10 radius_worst perimeter_mean 0.9694764
## 5 perimeter_worst radius_mean 0.9651365
## 15 radius_worst area_mean 0.9627461
## 17 area_worst area_mean 0.9592133
## 16 perimeter_worst area_mean 0.9591196
## 32 area_se radius_se 0.9518301
## 12 area_worst perimeter_mean 0.9415498
## 6 area_worst radius_mean 0.9410825
## 33 area_se perimeter_se 0.9376554
## 24 concave_points_mean concavity_mean 0.9213910
## 7 texture_worst texture_mean 0.9120446
## 30 concave_points_worst concave_points_mean 0.9101553
## 41 concavity_worst compactness_worst 0.8922609
## 25 concavity_worst concavity_mean 0.8841026
## 19 concavity_mean compactness_mean 0.8831207
## 21 compactness_worst compactness_mean 0.8658090
## 26 concave_points_worst concavity_mean 0.8613230
## 28 perimeter_worst concave_points_mean 0.8559231
## 44 concave_points_worst concavity_worst 0.8554339
## 9 concave_points_mean perimeter_mean 0.8509770
## 20 concave_points_mean compactness_mean 0.8311350
## 27 radius_worst concave_points_mean 0.8303176
## 13 concave_points_mean area_mean 0.8232689
## 3 concave_points_mean radius_mean 0.8225285
## 40 concave_points_worst perimeter_worst 0.8163221
## 22 concavity_worst compactness_mean 0.8162752
## 23 concave_points_worst compactness_mean 0.8155732
## 34 area_worst area_se 0.8114080
## 43 fractal_dimension_worst compactness_worst 0.8104549
## 29 area_worst concave_points_mean 0.8096296
## 18 smoothness_worst smoothness_mean 0.8053242
## 36 fractal_dimension_se compactness_se 0.8032688
## 35 concavity_se compactness_se 0.8012683
## 42 concave_points_worst compactness_worst 0.8010804
## 14 area_se area_mean 0.8000859
EDA Results:
All the higher means of the variables (mean, se, worst) in the malignant breast cancer group showed p < 0.01 (statistical significance).
The variables which showed the most statistical significance difference between the malignant breast cancer and the benign breast cancer groups were:
Across 480 correlations between all features (30), 44 (9.08%) showed a very strong correlation value.
Through the point-biserial correlation analysis, the following variables showed a strong correlation value with the diagnosis variable (malignant or benign):
Principal components analysis (PCA) is a data-reduction technique that transforms a larger number of correlated variables into a smaller set of uncorrelated variables called principal components (PC) or dimensions. We think, that the PCA method could improve the data analysis of this dataset, which has 30 variables highly correlated.
PCA is a great pre-processing tool for picking out the most relevant linear combination of variables and using them in prediction models.
The only drawback PCA has is that it generates the principal components in a unsupervised manner - without looking at the target vector. Besides, it is generally more difficult to interpret the predictors, since each principal component is a combination of original features.
df <- subset(df, select = -id)
df.v <- subset(df, select = -diagnosis)
df.d <- subset(df, select = diagnosis)
# Apply PCA
df.pca <- PCA(df.v, scale.unit = TRUE, graph = FALSE)
summary(df.pca)
##
## Call:
## PCA(X = df.v, scale.unit = TRUE, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 13.282 5.691 2.818 1.981 1.649 1.207
## % of var. 44.272 18.971 9.393 6.602 5.496 4.025
## Cumulative % of var. 44.272 63.243 72.636 79.239 84.734 88.759
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.675 0.477 0.417 0.351 0.294 0.261
## % of var. 2.251 1.589 1.390 1.169 0.980 0.871
## Cumulative % of var. 91.010 92.598 93.988 95.157 96.137 97.007
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.241 0.157 0.094 0.080 0.059 0.053
## % of var. 0.805 0.523 0.314 0.266 0.198 0.175
## Cumulative % of var. 97.812 98.335 98.649 98.915 99.113 99.288
## Dim.19 Dim.20 Dim.21 Dim.22 Dim.23 Dim.24
## Variance 0.049 0.031 0.030 0.027 0.024 0.018
## % of var. 0.165 0.104 0.100 0.091 0.081 0.060
## Cumulative % of var. 99.453 99.557 99.657 99.749 99.830 99.890
## Dim.25 Dim.26 Dim.27 Dim.28 Dim.29 Dim.30
## Variance 0.015 0.008 0.007 0.002 0.001 0.000
## % of var. 0.052 0.027 0.023 0.005 0.002 0.000
## Cumulative % of var. 99.942 99.969 99.992 99.997 100.000 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr
## 1 | 10.710 | 9.193 1.118 0.737 | 1.949 0.117
## 2 | 5.132 | 2.388 0.075 0.216 | -3.768 0.438
## 3 | 6.119 | 5.734 0.435 0.878 | -1.075 0.036
## 4 | 13.986 | 7.123 0.671 0.259 | 10.276 3.261
## 5 | 5.868 | 3.935 0.205 0.450 | -1.948 0.117
## 6 | 5.735 | 2.380 0.075 0.172 | 3.950 0.482
## 7 | 3.970 | 2.239 0.066 0.318 | -2.690 0.223
## 8 | 4.195 | 2.143 0.061 0.261 | 2.340 0.169
## 9 | 6.017 | 3.175 0.133 0.278 | 3.392 0.355
## 10 | 12.163 | 6.352 0.534 0.273 | 7.727 1.844
## cos2 Dim.3 ctr cos2
## 1 0.033 | -1.123 0.079 0.011 |
## 2 0.539 | -0.529 0.017 0.011 |
## 3 0.031 | -0.552 0.019 0.008 |
## 4 0.540 | -3.233 0.652 0.053 |
## 5 0.110 | 1.390 0.120 0.056 |
## 6 0.474 | -2.935 0.537 0.262 |
## 7 0.459 | -1.640 0.168 0.171 |
## 8 0.311 | -0.872 0.047 0.043 |
## 9 0.318 | -3.120 0.607 0.269 |
## 10 0.404 | -4.342 1.176 0.127 |
##
## Variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2
## radius_mean | 0.798 4.792 0.636 | -0.558 5.469 0.311 |
## texture_mean | 0.378 1.076 0.143 | -0.142 0.356 0.020 |
## perimeter_mean | 0.829 5.177 0.688 | -0.513 4.630 0.264 |
## area_mean | 0.805 4.884 0.649 | -0.551 5.340 0.304 |
## smoothness_mean | 0.520 2.033 0.270 | 0.444 3.464 0.197 |
## compactness_mean | 0.872 5.726 0.760 | 0.362 2.307 0.131 |
## concavity_mean | 0.942 6.677 0.887 | 0.144 0.362 0.021 |
## concave_points_mean | 0.951 6.804 0.904 | -0.083 0.121 0.007 |
## symmetry_mean | 0.504 1.909 0.254 | 0.454 3.623 0.206 |
## fractal_dimension_mean | 0.235 0.414 0.055 | 0.875 13.438 0.765 |
## Dim.3 ctr cos2
## radius_mean -0.014 0.007 0.000 |
## texture_mean 0.108 0.417 0.012 |
## perimeter_mean -0.016 0.009 0.000 |
## area_mean 0.048 0.082 0.002 |
## smoothness_mean -0.175 1.088 0.031 |
## compactness_mean -0.124 0.549 0.015 |
## concavity_mean 0.005 0.001 0.000 |
## concave_points_mean -0.043 0.065 0.002 |
## symmetry_mean -0.068 0.162 0.005 |
## fractal_dimension_mean -0.038 0.051 0.001 |
# Extract the eigenvalues of principal components
eig.val <- as.data.frame(get_eigenvalue(df.pca))
subset(eig.val, eigenvalue > 1) # The Kaiser–Harris criterion
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 13.281608 44.272026 44.27203
## Dim.2 5.691355 18.971182 63.24321
## Dim.3 2.817949 9.393163 72.63637
## Dim.4 1.980640 6.602135 79.23851
## Dim.5 1.648731 5.495768 84.73427
## Dim.6 1.207357 4.024522 88.75880
The Kaiser–Harris criterion suggests retaining components with eigenvalues greater than 1 (cutoff point). Thus, the cutoff point has a eigenvalue = 1.207, so we stopped at the sixth principal component.
fviz_eig(df.pca, addlabels=TRUE, hjust = 0, barfill = "#4189b3", ncp=6) + ylim(0, 50)
In our analysis, the first six principal components explain 88.75% of the dataset variance. The first dimension is associated with the largest eigenvalue, the second dimension with the second-largest eigenvalue, and so on.
head(get_pca_var(df.pca)$cos2)
## Dim.1 Dim.2 Dim.3 Dim.4
## radius_mean 0.6364318 0.31125539 0.0002050963 0.003396209
## texture_mean 0.1428940 0.02028864 0.0117415199 0.720298141
## perimeter_mean 0.6876316 0.26352690 0.0002444703 0.003491038
## area_mean 0.6486576 0.30389811 0.0023210397 0.005655066
## smoothness_mean 0.2700393 0.19713747 0.0306502709 0.050313944
## compactness_mean 0.7604714 0.13130559 0.0154693024 0.002002220
## Dim.5
## radius_mean 0.0023540715
## texture_mean 0.0040347193
## perimeter_mean 0.0023030547
## area_mean 0.0001759769
## smoothness_mean 0.2197586899
## compactness_mean 0.0002258480
par(mfrow=c(1,2))
corrplot(get_pca_var(df.pca)$cos2[1:16,], number.cex = .65, addCoef.col = "black", tl.col = "black", tl.cex = 0.75)
corrplot(get_pca_var(df.pca)$cos2[17:30,], number.cex = .65, addCoef.col = "black", tl.col = "black", tl.cex = 0.75)
For example, in the columns labeled Dim.1 (the first PC), 63.64% of the variance in radius_mean variable is accounted by the Dim.1, while 31% is by the Dim.2 (second PC).
head(get_pca_var(df.pca)$coord)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## radius_mean 0.7977668 -0.5579027 -0.01432118 -0.05827700 -0.04851878
## texture_mean 0.3780132 -0.1424382 0.10835829 0.84870380 0.06351944
## perimeter_mean 0.8292355 -0.5133487 -0.01563555 -0.05908501 -0.04799015
## area_mean 0.8053928 -0.5512695 0.04817717 -0.07520017 -0.01326563
## smoothness_mean 0.5196530 0.4440017 -0.17507219 -0.22430770 0.46878427
## compactness_mean 0.8720501 0.3623611 -0.12437565 -0.04474618 -0.01502824
The columns contains the component loadings, which are the correlations of the observed variables with the principal components (PC). The radius_mean is positive strong (0.79) correlated to the first principal component; while is negative moderate (-0.55) correlated to the second component.
fviz_pca_var(df.pca,labelsize = 3,
col.var = "coord",
gradient.cols = c("#56B4E9", "#fec306", "#df5227"),
repel = TRUE
)
The area_mean, area_worst, radius_mean, radius_worst, perimeter_mean and perimeter_worst are positively correlated, and those 6 metrics contribute the most to the construction of the first principal component (dimension 1).
The fractal_dimension_mean, fractal_dimension_se, fractal_dimension_worst and smoothness_se contribute the most to 2nd component.
Thus, the 1st component mainly relates to geometric quantitative measures (area, radius and perimeter); while the 2nd dimension is mainly relates to appearance/aspect or geometric qualitative measures (fractal_dimensions, smoothness).
head(get_pca_var(df.pca)$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## radius_mean 4.791828 5.4689158 0.007278210 0.1714702 0.14278085
## texture_mean 1.075879 0.3564817 0.416669002 36.3669303 0.24471672
## perimeter_mean 5.177322 4.6303018 0.008675469 0.1762581 0.13968654
## area_mean 4.883878 5.3396446 0.082366279 0.2855170 0.01067348
## smoothness_mean 2.033182 3.4638057 1.087680124 2.5402866 13.32896332
## compactness_mean 5.725748 2.3071061 0.548956087 0.1010895 0.01369829
p1 <- fviz_contrib(df.pca, choice = "var", axes = 1, fill="#4189b3", top=15)
p2 <- fviz_contrib(df.pca, choice = "var", axes = 2, fill="#f69400", color="white", top=15)
grid.arrange(p1,p2,ncol=2)
The area_mean contributes 4.88% to the first principal component and 5.33% to the second component. The texture_mean variable contributes 36.36% to the fourth component.
fviz_pca_ind(df.pca,
geom.ind = "point",
col.var = "black",
col.ind = df.d$diagnosis,
palette = c("#f69400","#4189b3"),
addEllipses = TRUE,
legend.title = "Diagnosis",
mean.point = FALSE, labelsize = 3, pointsize = 3, pointshape = 20)
The 1st principal component (dimension 1) indicates the principal axis of variability between groups (benign and malignant).
# Train and Test (Original Data)
set.seed(1234)
training.samples <- df$diagnosis %>%
createDataPartition(p = 0.8, list = FALSE)
df.train <- df[ training.samples,]
df.test <- df[-training.samples,]
# Train and Test (PCA pre-processing)
df.pca2 <- PCA(df.v, scale.unit = TRUE, graph = FALSE, ncp = 6)
set.seed(1234)
df.pca.final <- cbind(df.d, df.pca2$ind$coord)
training.samples.pca <- df.pca.final$diagnosis %>%
createDataPartition(p = 0.8, list = FALSE)
df.train.pca <- df.pca.final[ training.samples.pca,]
df.test.pca <- df.pca.final[-training.samples.pca,]
# Original Data
set.seed(1234)
model.knn <- train(
diagnosis ~ ., data = df.train, method = "knn",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 20
)
plot(model.knn)
# Prediction Original Data
predicted.classes <- model.knn %>% predict(df.test)
matrix.knn <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 4
## M 0 38
##
## Accuracy : 0.9646
## 95% CI : (0.9118, 0.9903)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9227
##
## Mcnemar's Test P-Value : 0.1336
##
## Sensitivity : 1.0000
## Specificity : 0.9048
## Pos Pred Value : 0.9467
## Neg Pred Value : 1.0000
## Prevalence : 0.6283
## Detection Rate : 0.6283
## Detection Prevalence : 0.6637
## Balanced Accuracy : 0.9524
##
## 'Positive' Class : B
##
# PCA
set.seed(1234)
model.knn.pca <- train(
diagnosis ~ ., data = df.train.pca, method = "knn",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 20
)
plot(model.knn.pca)
# Prediction PCA
predicted.classes.pca <- model.knn.pca %>% predict(df.test.pca)
matrix.knn.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.knn.pca
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 10
## M 0 32
##
## Accuracy : 0.9115
## 95% CI : (0.8433, 0.9567)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 6.062e-12
##
## Kappa : 0.8008
##
## Mcnemar's Test P-Value : 0.004427
##
## Sensitivity : 1.0000
## Specificity : 0.7619
## Pos Pred Value : 0.8765
## Neg Pred Value : 1.0000
## Prevalence : 0.6283
## Detection Rate : 0.6283
## Detection Prevalence : 0.7168
## Balanced Accuracy : 0.8810
##
## 'Positive' Class : B
##
# CART Model Original Data
set.seed(1234)
model.tree <- rpart(
diagnosis ~ ., data = df.train, method = "class")
rpart.plot(model.tree, extra=108)
printcp(model.tree)
##
## Classification tree:
## rpart(formula = diagnosis ~ ., data = df.train, method = "class")
##
## Variables actually used in tree construction:
## [1] area_se concave_points_worst perimeter_worst
## [4] texture_mean
##
## Root node error: 170/456 = 0.37281
##
## n= 456
##
## CP nsplit rel error xerror xstd
## 1 0.800000 0 1.000000 1.00000 0.060740
## 2 0.076471 1 0.200000 0.29412 0.039248
## 3 0.017647 2 0.123529 0.18235 0.031619
## 4 0.010000 4 0.088235 0.17059 0.030654
rpart.rules(model.tree, extra=108)
## diagnosis
## 0.90 when perimeter_worst < 115 & concave_points_worst < 0.16 & area_se >= 33 & texture_mean < 21
## 0.98 when perimeter_worst < 115 & concave_points_worst < 0.16 & area_se < 33
## 0.75 when perimeter_worst < 115 & concave_points_worst < 0.16 & area_se >= 33 & texture_mean >= 21
## 0.88 when perimeter_worst < 115 & concave_points_worst >= 0.16
## 0.98 when perimeter_worst >= 115
# Prediction
predicted.classes <- model.tree %>% predict(df.test, type = "class")
matrix.tree <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 68 7
## M 3 35
##
## Accuracy : 0.9115
## 95% CI : (0.8433, 0.9567)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 6.062e-12
##
## Kappa : 0.8068
##
## Mcnemar's Test P-Value : 0.3428
##
## Sensitivity : 0.9577
## Specificity : 0.8333
## Pos Pred Value : 0.9067
## Neg Pred Value : 0.9211
## Prevalence : 0.6283
## Detection Rate : 0.6018
## Detection Prevalence : 0.6637
## Balanced Accuracy : 0.8955
##
## 'Positive' Class : B
##
#Pruning
model.tree.p <- prune(model.tree, cp=.011765)
rpart.plot(model.tree.p, extra=108)
rpart.rules(model.tree.p, extra=108)
## diagnosis
## 0.90 when perimeter_worst < 115 & concave_points_worst < 0.16 & area_se >= 33 & texture_mean < 21
## 0.98 when perimeter_worst < 115 & concave_points_worst < 0.16 & area_se < 33
## 0.75 when perimeter_worst < 115 & concave_points_worst < 0.16 & area_se >= 33 & texture_mean >= 21
## 0.88 when perimeter_worst < 115 & concave_points_worst >= 0.16
## 0.98 when perimeter_worst >= 115
# Prediction pos-pruning
predicted.classes <- model.tree.p %>% predict(df.test, type = "class")
matrix.tree.p <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.tree.p
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 68 7
## M 3 35
##
## Accuracy : 0.9115
## 95% CI : (0.8433, 0.9567)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 6.062e-12
##
## Kappa : 0.8068
##
## Mcnemar's Test P-Value : 0.3428
##
## Sensitivity : 0.9577
## Specificity : 0.8333
## Pos Pred Value : 0.9067
## Neg Pred Value : 0.9211
## Prevalence : 0.6283
## Detection Rate : 0.6018
## Detection Prevalence : 0.6637
## Balanced Accuracy : 0.8955
##
## 'Positive' Class : B
##
# PCA
set.seed(1234)
model.tree.pca <- rpart(
diagnosis ~ ., data = df.train.pca, method = "class")
rpart.plot(model.tree.pca, extra=108)
printcp(model.tree.pca)
##
## Classification tree:
## rpart(formula = diagnosis ~ ., data = df.train.pca, method = "class")
##
## Variables actually used in tree construction:
## [1] Dim.1 Dim.2 Dim.3 Dim.5
##
## Root node error: 170/456 = 0.37281
##
## n= 456
##
## CP nsplit rel error xerror xstd
## 1 0.776471 0 1.000000 1.00000 0.060740
## 2 0.041176 1 0.223529 0.30588 0.039926
## 3 0.035294 3 0.141176 0.26471 0.037462
## 4 0.011765 4 0.105882 0.20588 0.033438
## 5 0.010000 5 0.094118 0.20000 0.032996
rpart.rules(model.tree.pca, extra=108)
## diagnosis
## 0.88 when Dim.1 >= 1.2 & Dim.5 < -2
## 0.93 when Dim.1 is -1.0 to 1.2 & Dim.2 >= -1.3 & Dim.3 >= -2
## 0.99 when Dim.1 < -1.0
## 0.62 when Dim.1 is -1.0 to 1.2 & Dim.2 >= -1.3 & Dim.3 < -2
## 0.85 when Dim.1 is -1.0 to 1.2 & Dim.2 < -1.3
## 0.98 when Dim.1 >= 1.2 & Dim.5 >= -2
# Prediction PCA
predicted.classes.pca <- model.tree.pca %>% predict(df.test.pca, type = "class")
matrix.tree.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.tree.pca
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 68 8
## M 3 34
##
## Accuracy : 0.9027
## 95% CI : (0.8325, 0.9504)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 3.429e-11
##
## Kappa : 0.7864
##
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.9577
## Specificity : 0.8095
## Pos Pred Value : 0.8947
## Neg Pred Value : 0.9189
## Prevalence : 0.6283
## Detection Rate : 0.6018
## Detection Prevalence : 0.6726
## Balanced Accuracy : 0.8836
##
## 'Positive' Class : B
##
# Random Forest Model (All variables) Original Data
set.seed(1234)
model.rf <- train(
diagnosis ~ ., data = df.train, method = "rf",
trControl = trainControl("cv", number = 10),
importance = FALSE
)
model.rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 30
##
## OOB estimate of error rate: 4.39%
## Confusion matrix:
## B M class.error
## B 278 8 0.02797203
## M 12 158 0.07058824
# Plot MeanDecreaseGini
varImpPlot(model.rf$finalModel, type = 2)
varImp(model.rf)
## rf variable importance
##
## only 20 most important variables shown (out of 30)
##
## Overall
## perimeter_worst 100.0000
## concave_points_worst 86.6696
## area_worst 30.1371
## concave_points_mean 24.6583
## radius_worst 14.9139
## texture_worst 5.9261
## texture_mean 5.4653
## area_se 4.8550
## concavity_worst 2.8408
## concavity_mean 2.0266
## smoothness_worst 1.5420
## compactness_worst 1.3437
## area_mean 0.9296
## fractal_dimension_se 0.6642
## symmetry_worst 0.6553
## fractal_dimension_worst 0.6492
## symmetry_mean 0.6414
## radius_se 0.5587
## concave_points_se 0.4607
## texture_se 0.4460
# Prediction
predicted.classes <- model.rf %>% predict(df.test)
matrix.rf <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 69 2
## M 2 40
##
## Accuracy : 0.9646
## 95% CI : (0.9118, 0.9903)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9242
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9718
## Specificity : 0.9524
## Pos Pred Value : 0.9718
## Neg Pred Value : 0.9524
## Prevalence : 0.6283
## Detection Rate : 0.6106
## Detection Prevalence : 0.6283
## Balanced Accuracy : 0.9621
##
## 'Positive' Class : B
##
# Random Forest Model (Top 5 variables most important) Original Data
set.seed(1234)
model.rf2 <- train(
diagnosis ~ perimeter_worst + radius_worst + concave_points_worst + area_worst + concave_points_mean, data = df.train, method = "rf",
trControl = trainControl("cv", number = 10),
importance = FALSE
)
model.rf2$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 6.58%
## Confusion matrix:
## B M class.error
## B 273 13 0.04545455
## M 17 153 0.10000000
# Plot MeanDecreaseGini
varImpPlot(model.rf2$finalModel, type = 2)
varImp(model.rf2)
## rf variable importance
##
## Overall
## perimeter_worst 100.000
## concave_points_worst 57.028
## area_worst 28.657
## radius_worst 8.771
## concave_points_mean 0.000
# Prediction
predicted.classes <- model.rf2 %>% predict(df.test)
matrix.rf2 <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.rf2
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 68 4
## M 3 38
##
## Accuracy : 0.9381
## 95% CI : (0.8765, 0.9747)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 1.718e-14
##
## Kappa : 0.8667
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9577
## Specificity : 0.9048
## Pos Pred Value : 0.9444
## Neg Pred Value : 0.9268
## Prevalence : 0.6283
## Detection Rate : 0.6018
## Detection Prevalence : 0.6372
## Balanced Accuracy : 0.9313
##
## 'Positive' Class : B
##
# PCA
set.seed(1234)
model.rf.pca <- train(
diagnosis ~ ., data = df.train.pca, method = "rf",
trControl = trainControl("cv", number = 10),
importance = FALSE
)
model.rf.pca$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.39%
## Confusion matrix:
## B M class.error
## B 278 8 0.02797203
## M 12 158 0.07058824
# Plot MeanDecreaseGini
varImpPlot(model.rf.pca$finalModel, type = 2)
varImp(model.rf.pca)
## rf variable importance
##
## Overall
## Dim.1 100.0000
## Dim.2 14.4252
## Dim.3 10.1429
## Dim.5 2.3204
## Dim.4 0.3595
## Dim.6 0.0000
# Prediction PCA
predicted.classes.pca <- model.rf.pca %>% predict(df.test.pca)
matrix.rf.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.rf.pca
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 69 3
## M 2 39
##
## Accuracy : 0.9558
## 95% CI : (0.8998, 0.9855)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9048
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9718
## Specificity : 0.9286
## Pos Pred Value : 0.9583
## Neg Pred Value : 0.9512
## Prevalence : 0.6283
## Detection Rate : 0.6106
## Detection Prevalence : 0.6372
## Balanced Accuracy : 0.9502
##
## 'Positive' Class : B
##
model.ml.pca <- train(diagnosis ~., data = df.train.pca, method = "glm")
summary(model.ml.pca)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6870 -0.0488 -0.0042 0.0006 3.5749
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6840 0.3626 -1.886 0.059244 .
## Dim.1 2.8572 0.4982 5.735 9.78e-09 ***
## Dim.2 -1.8064 0.3619 -4.991 6.00e-07 ***
## Dim.3 -0.8409 0.3055 -2.752 0.005915 **
## Dim.4 0.7382 0.2432 3.036 0.002400 **
## Dim.5 1.7634 0.5349 3.297 0.000978 ***
## Dim.6 0.4458 0.3335 1.337 0.181296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 602.315 on 455 degrees of freedom
## Residual deviance: 63.232 on 449 degrees of freedom
## AIC: 77.232
##
## Number of Fisher Scoring iterations: 10
predicted.classes.pca <- model.ml.pca %>% predict(df.test.pca)
matrix.ml.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.ml.pca
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 3
## M 1 39
##
## Accuracy : 0.9646
## 95% CI : (0.9118, 0.9903)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9235
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.9859
## Specificity : 0.9286
## Pos Pred Value : 0.9589
## Neg Pred Value : 0.9750
## Prevalence : 0.6283
## Detection Rate : 0.6195
## Detection Prevalence : 0.6460
## Balanced Accuracy : 0.9572
##
## 'Positive' Class : B
##
# Support Vector Machine Original Data
set.seed(1234)
model.svm <- train(
diagnosis ~., data = df.train, method = "svmLinear",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale")
)
predicted.classes <- model.svm %>% predict(df.test)
matrix.svm <- confusionMatrix(predicted.classes, df.test$diagnosis)
matrix.svm
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 4
## M 0 38
##
## Accuracy : 0.9646
## 95% CI : (0.9118, 0.9903)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9227
##
## Mcnemar's Test P-Value : 0.1336
##
## Sensitivity : 1.0000
## Specificity : 0.9048
## Pos Pred Value : 0.9467
## Neg Pred Value : 1.0000
## Prevalence : 0.6283
## Detection Rate : 0.6283
## Detection Prevalence : 0.6637
## Balanced Accuracy : 0.9524
##
## 'Positive' Class : B
##
# PCA
set.seed(1234)
model.svm.pca <- train(
diagnosis ~., data = df.train.pca, method = "svmLinear",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale")
)
predicted.classes.pca <- model.svm.pca %>% predict(df.test.pca)
matrix.svm.pca <- confusionMatrix(predicted.classes.pca, df.test.pca$diagnosis)
matrix.svm.pca
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 3
## M 0 39
##
## Accuracy : 0.9735
## 95% CI : (0.9244, 0.9945)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9423
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.9286
## Pos Pred Value : 0.9595
## Neg Pred Value : 1.0000
## Prevalence : 0.6283
## Detection Rate : 0.6283
## Detection Prevalence : 0.6549
## Balanced Accuracy : 0.9643
##
## 'Positive' Class : B
##
#KNN
knn.1 <- as.data.frame(matrix.knn$overall["Accuracy"])
colnames(knn.1) <- ""
knn.2 <- as.data.frame(matrix.knn$byClass[1:4])
colnames(knn.2) <- ""
knn <- rbind(knn.1, knn.2)
colnames(knn) <- "KNN"
knn.pca1 <- as.data.frame(matrix.knn.pca$overall["Accuracy"])
colnames(knn.pca1) <- ""
knn.pca2 <- as.data.frame(matrix.knn.pca$byClass[1:4])
colnames(knn.pca2) <- ""
knn.pca <- rbind(knn.pca1, knn.pca2)
colnames(knn.pca) <- "KNN PCA"
row.names(knn.pca) <- c()
# CART
tree.1 <- as.data.frame(matrix.tree$overall["Accuracy"])
colnames(tree.1) <- ""
tree.2 <- as.data.frame(matrix.tree$byClass[1:4])
colnames(tree.2) <- ""
tree <- rbind(tree.1, tree.2)
colnames(tree) <- "CART"
row.names(tree) <- c()
tree.p.1 <- as.data.frame(matrix.tree.p$overall["Accuracy"])
colnames(tree.p.1) <- ""
tree.p.2 <- as.data.frame(matrix.tree.p$byClass[1:4])
colnames(tree.p.2) <- ""
tree.p <- rbind(tree.p.1, tree.p.2)
colnames(tree.p) <- "CART Pruned"
row.names(tree.p) <- c()
tree.pca1 <- as.data.frame(matrix.tree.pca$overall["Accuracy"])
colnames(tree.pca1) <- ""
tree.pca2 <- as.data.frame(matrix.tree.pca$byClass[1:4])
colnames(tree.pca2) <- ""
tree.pca <- rbind(tree.pca1, tree.pca2)
colnames(tree.pca) <- "CART PCA"
row.names(tree.pca) <- c()
#RF
rf.1 <- as.data.frame(matrix.rf$overall["Accuracy"])
colnames(rf.1) <- ""
rf.2 <- as.data.frame(matrix.rf$byClass[1:4])
colnames(rf.2) <- ""
rf <- rbind(rf.1, rf.2)
colnames(rf) <- "RF"
row.names(rf) <- c()
rf2.1 <- as.data.frame(matrix.rf2$overall["Accuracy"])
colnames(rf2.1) <- ""
rf2.2 <- as.data.frame(matrix.rf2$byClass[1:4])
colnames(rf2.2) <- ""
rf2 <- rbind(rf2.1, rf2.2)
colnames(rf2) <- "RF (TOP 5)"
row.names(rf2) <- c()
rf.pca1 <- as.data.frame(matrix.rf.pca$overall["Accuracy"])
colnames(rf.pca1) <- ""
rf.pca2 <- as.data.frame(matrix.rf.pca$byClass[1:4])
colnames(rf.pca2) <- ""
rf.pca <- rbind(rf.pca1, rf.pca2)
colnames(rf.pca) <- "RF PCA"
row.names(rf.pca) <- c()
# Logit
ml.pca1 <- as.data.frame(matrix.ml.pca$overall["Accuracy"])
colnames(ml.pca1) <- ""
ml.pca2 <- as.data.frame(matrix.ml.pca$byClass[1:4])
colnames(ml.pca2) <- ""
ml.pca <- rbind(ml.pca1, ml.pca2)
colnames(ml.pca) <- "Logit PCA"
row.names(ml.pca) <- c()
#SVM
svm.1 <- as.data.frame(matrix.svm$overall["Accuracy"])
colnames(svm.1) <- ""
svm.2 <- as.data.frame(matrix.svm$byClass[1:4])
colnames(svm.2) <- ""
svm <- rbind(svm.1, svm.2)
colnames(svm) <- "SVM"
row.names(svm) <- c()
svm.pca1 <- as.data.frame(matrix.svm.pca$overall["Accuracy"])
colnames(svm.pca1) <- ""
svm.pca2 <- as.data.frame(matrix.svm.pca$byClass[1:4])
colnames(svm.pca2) <- ""
svm.pca <- rbind(svm.pca1, svm.pca2)
colnames(svm.pca) <- "SVM PCA"
row.names(svm.pca) <- c()
final <- as.data.frame(t(cbind(knn, knn.pca, tree, tree.p, tree.pca, rf, rf2, rf.pca, ml.pca, svm, svm.pca)))
as.datatable(formattable(final, list(
Accuracy = color_tile("#e6ad9c","#df5227"),
Sensitivity = color_tile("#b5d7eb","#56B4E9"),
Specificity = color_tile("#b5d7eb","#56B4E9"),
`Pos Pred Value` = color_tile("#f2dfa2","#fec306"),
`Neg Pred Value` = color_tile("#f2dfa2","#fec306")
)), options = list(pageLength =11, dom = 'tip'))
The Support Vector Machine model with PCA pre-processing performed better:
The Random Forest, Regression Logistic with PCA pre-processing, KNN and SVM models achieved a very good accuracy as well, 96.4%.
In general, the models performed with a range accuracy of 90.2% - 97.3% with a good balance of sensitivity and specificity. By balance we mean similar levels of performance.
Finally, taking into account the analysis conducted, we’d like to point out that the following cell nuclei characteristics seem to be the most relevant for diagnosis of breast cancer through the fine needle aspiration (FNA) procedure:
We hope you enjoy this kernel…
If you have any question or suggestion about this project, we will be appreciate to receive them…